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Abstract 


We show that natural classes of regularized learning algorithms with a form of 
recency bias achieve faster convergence rates to approximate efficiency and to 
coarse correlated equilibria in multiplayer normal form games. When each player 
in a game uses an algorithm from our class, their individual regret decays at 
q(2i- 3/4), yyjjjjg (}jg Qf utilities converges to an approximate optimum at 
0(T“^)-an improvement upon the worst case rates. We show a black¬ 
box reduction for any algorithm in the class to achieve rates against an 

adversary, while maintaining the faster rates against algorithms in the class. Our 
results extend those of Rakhlin and Shridharan ifTSll and Daskalakis et al. m, who 
only analyzed two-player zero-sum games for specific algorithms. 

1 Introduction 

What happens when players in a game interact with one another, all of them acting independently 
and selfishly to maximize their own utilities? If they are smart, we intuitively expect their utilities 
— both individually and as a group — to grow, perhaps even to approach the best possible. We 
also expect the dynamics of their behavior to eventually reach some kind of equilibrium. Under¬ 
standing these dynamics is central to game theory as well as its various application areas, including 
economics, network routing, auction design, and evolutionary biology. 

It is natural in this setting for the players to each make use of a no-regret learning algorithm for mak¬ 
ing their decisions, an approach known as decentralized no-regret dynamics. No-regret algorithms 
are a strong match for playing games because their regret bounds hold even in adversarial environ¬ 
ments. As a benefit, these bounds ensure that each player’s utility approaches optimality. When 
played against one another, it can also be shown that the sum of utilities approaches an approximate 
optimum EHH, and the player strategies converge to an equilibrium under appropriate condi¬ 
tions HI [1113, at rates governed by the regret bounds. Well-known families of no-regret algorithms 
include multiplicative-weights KHIll, Mirror Descent ca, and Follow the Regularized/Perturbed 
Leader d. (See EED for excellent overviews.) For all of these, the average regret vanishes at 
the worst-case rate of 0{1/VT), which is unimprovable in fully adversarial scenarios. 

However, the players in our setting are facing other similar, predictable no-regret learning algo¬ 
rithms, a chink that hints at the possibility of improved convergence rates for such dynamics. This 
was first observed and exploited by Daskalakis et al. a. For two-player zero-sum games, they de¬ 
veloped a decentralized variant of Nesterov’s accelerated saddle point algorithm flbl and showed 
that each player’s average regret converges at the remarkable rate of 0{1/T). Although the resulting 
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dynamics are somewhat unnatural, in later work, Rakhlin and Sridharan ifTSl showed surprisingly 
that the same convergence rate holds for a simple variant of Mirror Descent with the seemingly 
minor modification that the last utility observation is counted twice. 

Although major steps forward, both these works are limited to two-player zero-sum games, the very 
simplest case. As such, they do not cover many practically important settings, such as auctions or 
routing games, which are decidedly not zero-sum, and which involve many independent actors. 

In this paper, we vastly generalize these techniques to the practically important but far more chal¬ 
lenging case of arbitrary multi-player normal-form games, giving natural no-regret dynamics whose 
convergence rates are much faster than previously possible for this general setting. 

Contributions. We show that the average welfare of the game, that is, the sum of player utilities, 
converges to approximately optimal welfare at the rate 0(1/T), rather than the previously known 
rate of 0{1/\/T). Concretely, we show a natural class of regularized no-regret algorithms with re¬ 
cency bias that achieve welfare at least (A/(l -f p))OPT — 0{1/T), where A and p are parameters 
in a smoothness condition on the game introduced by Roughgarden US). For the same class of algo¬ 
rithms, we show that each individual player’s average regret converges to zero at the rate O . 

Thus, our results entail an algorithm for computing coarse correlated equilibria in a decentralized 
manner with significantly faster convergence than existing methods. 

We additionally give a black-box reduction that preserves the fast rates in favorable environments, 
while robustly maintaining 0(1/VT) regret against any opponent in the worst case. 

Even for two-person zero-sum games, our results for general games expose a hidden generality and 
modularity underlying the previous results mull. First, our analysis identifies stability and recency 
bias as key structural ingredients of an algorithm with fast rates. This covers the Optimistic Mirror 
Descent of Rakhlin and Sridharan ifTSl as an example, but also applies to optimistic variants of Fol¬ 
low the Regularized Deader (FTRF), including dependence on arbitrary weighted windows in the 
history as opposed to just the utility from the last round. Recency bias is a behavioral pattern com¬ 
monly observed in game-theoretic environments Qol; as such, our results can be viewed as a partial 
theoretical justification. Second, previous approaches in iiEii on achieving both faster conver¬ 
gence against similar algorithms while at the same time 0(1/VT) regret rates against adversaries 
were shown via ad-hoc modifications of specific algorithms. We give a black-box modification 
which is not algorithm specific and works for all these optimistic algorithms. 

Finally, we simulate a 4-bidder simultaneous auction game, and compare our optimistic algorithms 
against Hedge IHl in terms of utilities, regrets and convergence to equilibria. 

2 Repeated Game Model and Dynamics 

Consider a static game G among a set N of n players. Each player i has a strategy space Si and a 
utility function iti : iSi x ... x iSn —>■ [0, 1] that maps a strategy profile s = (si,..., s„) to a utility 
Ui{s). We assume that the strategy space of each player is finite and has cardinality d, i.e. |5'i| = d. 
We denote with w = (wi,..., w„) a profile of mixed strategies, where G A(S'i) and Wi^x is the 
probability of strategy x G Si. Finally let Ui{w) = Es...,w[ui(s)], the expected utility of player i. 

We consider the setting where the game G is played repeatedly for T time steps. At each time 
step t each player i picks a mixed strategy w* G A (Si). At the end of the iteration each player i 
observes the expected utility he would have received had he played any possible strategy x G Si. 
More formally, let u\ ^ [ui(x, s_i)], where s_i is the set of strategies of all but the 

player, and let u* = (u\ x)xGSi ■ At the end of each iteration each player i observes u*. Observe that 
the expected utility of a player at iteration t is simply the inner product (w‘, u*). 

No-regret dynamics. We assume that the players each decide their strategy w* based on a van¬ 
ishing regret algorithm. Formally, for each player i, the regret after T time steps is equal to the 
maximum gain he could have achieved by switching to any other fixed strategy: 

T 

ri(T)= sup ^ (w* - w*, u‘) . 

w*GA(5i) 
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The algorithm has vanishing regret if ri{T) = o{T). 


Approximate Efficiency of No-Regret Dynamics. We are interested in analyzing the average 
welfare of such vanishing regret sequences. For a given strategy profile s the social welfare is 
defined as the sum of the player utilities: VF(s) = We overload notation to denote 

fF(w) = Es...,w [M^(s)] . We want to lower bound how far the average welfare of the sequence is, 
with respect to the optimal welfare of the static game: 

Opt = max VF(s). 

sGSiX...xS„ 

This is the optimal welfare achievable in the absence of player incentives and if a central coordinator 
could dictate each player’s strategy. We next define a class of games first identified by Roughgar- 
den lfT9l on which we can approximate the optimal welfare using decoupled no-regret dynamics. 

Definition 1 (Smooth game 1191 1. A game is (A, fi)-smooth if there exists a strategy profile s* such 
that for any strategy profile s.- — AOpt — p,W (s). 


In words, any player using his optimal strategy continues to do well irrespective of other players’ 
strategies. This condition directly implies near-optimality of no-regret dynamics as we show below. 

Proposition 2. In a (A, pf-smooth game, if each player i suffers regret at most ri{T), then: 


1 

T 


t=i 


-Opt - 

1 -f /i 


1 1 
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ieN 


-Opt - 
P 


1 1 
\ pT 


ieN 


where the factor p = {1 p)/X is called the price of anarchy (PoAj. 


This proposition is essentially a more explicit version of Roughgarden’s result lUSl; we provide a 
proof in the appendix for completeness. The result shows that the convergence to PoA is driven 
by the quantity ^ There are many algorithms which achieve a regret rate of 

r,(r) = 0 (Vlog(^), in which case the latter theorem would imply that the average welfare con¬ 
verges to PoA at a rate of 0{n^,J\og{d)/T). As we will show, for some natural classes of no-regret 
algorithms the average welfare converges at the much faster rate of 0{'n? log(d) /T). 


3 Fast Convergence to Approximate Efficiency 

In this section, we present our main theoretical results characterizing a class of no-regret dynamics 
which lead to faster convergence in smooth games. We begin by describing this class. 

Definition 3 (RVU property). We say that a vanishing regret algorithm satisfies the Regret bounded 
by Variation in Utilities (RVU) property with parameters a > 0 and 0 < /3 < 7 and a pair of dual 
norms (|j • ||, || • regret on any sequence of utilities u^, u^,..., is bounded as 

^ (w* - w‘,u*) < a -f ||u* - - tX! 11 ’^* “ ( 1 ) 

t=i t=i t=i 


Typical online learning algorithms such as Mirror Descent and FTRL do not satisfy the RVU property 
in their vanilla form, as the middle term grows as II'^*11* these methods. However, Rakhlin 

and Sridharan ifTTll give a modification of Mirror Descent with this property, and we will present a 
similar variant of FTRL in the sequel. 

We now present two sets of results when each player uses an algorithm with this property. The 
first discusses the convergence of social welfare, while the second governs the convergence of the 
individual players’ utilities at a fast rate. 


*The dual to a norm || ■ || is defined as ||ii||* = sup||„||<i {u, v). 
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3.1 Fast Convergence of Social Welfare 

Given Proposition we only need to understand the evolution of the sum of players’ regrets 
Y^=i iti order to obtain convergence rates of the social welfare. Our main result in this 

section bounds this sum when each player uses dynamics with the RVU property. 

Theorem 4. Suppose that the algorithm of each player i satisfies the property RVU with parameters 
a, and 7 such that < u/{n — 1)^ and || • || = || • ||i- Then 


Proof Since u,(s) < 1, definitions imply: ||u*-u* ^ Hj 




.The 


latter is the total variation distance of two product distributions. By known properties of total varia¬ 
tion (see e.g. ini), this is bounded by the sum of the total variations of each marginal distribution: 


E 


n 

3^^ 


^3,s, - 


n 

3¥^^ 


w 


t-1 


^Ei 




( 2 ) 


By Jensen’s inequality, 


liw‘ - w‘ 


■ 11 )'<("-!) 




, SO that 

t-l ||2 


^li*<(?^-1)EE 11’^*'= («-1)^E 11 ^* 

i^N i^N jVi i&N 

The theorem follows by summing up the RVU property Q for each player i and observing that the 
summation of the second terms is smaller than that of the third terms and thereby can be dropped. ■ 


Remark: The rates from the theorem depend on a, which will be 0(1) in the sequel. The above 
theorem extends to the case where || • || is any norm equivalent to the £i norm. The resulting 
requirement on fi in terms of 7 can however be more stringent. Also, the theorem does not require 
that all players use the same no-regret algorithm unlike previous results umisi, as long as each 
player’s algorithm satisfies the RVU property with a common bound on the constants. 

We now instantiate the result with examples that satisfy the RVU property with different constants. 


3.1.1 Optimistic Mirror Descent 


The optimistic mirror descent (OMD) algorithm of Rakhlin and Sridharan ifTTll is parameterized by 
an adaptive predictor sequence M* and a regularizeij^ 7?. which is 1-strongly convetj^with respect 
to a norm || • ||. Let D-ji denote the Bregman divergence associated with TZ. Then the update rule is 
defined as follows: let g° = argmingg^j-g.) 7?.(g) and 

®(u,g) = argmaxTy • (w, u) - UK(w,g), 
wG A(Si) 


then: 

w* = $(M*,g*"^), and g‘ = T>(u*,g‘"^) 


Then the following proposition can be obtained for this method. 

Proposition 5. The OMD algorithm using stepsize rj and M* = satisfies the RVU property 
with constants a = R/ij, /? = p, 7 = 1 /( 877 ), where R = max^ sup^^ D-ji{f, g/). 

The proposition follows by further crystallizing the arguments of Rakhlin and Sridaran El, and we 
provide a proof in the appendix for completeness. The above proposition, along with Theore m [T] 
immediately yields the following corollary, which had been proved by Rakhlin and Sridharan 11^ 
for two-person zero-sum games, and which we here extend to general games. 

Corollary 6. If each player runs OMD with M* = and stepsize rj — l/{'\/S[n — 1)), then we 
have Ti{T) < nR/rj < n{n — l)s/8R — 0(1). 

The corollary follows by noting that the condition / 3 < 7 /(n — l)^is met with our choice of 77 . 

^Here and in the sequel, we can use a different regularizer TZt for each player i, without qualitatively 
affecting any of the results. 

^TZ is 1-strongly convex if TZ (■^^) < 
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3.1.2 Optimistic Follow the Regularized Leader 


We next consider a different class of algorithms denoted as optimistic follow the regularized leader 
(OFTRL). This algorithm is similar but not equivalent to OMD, and is an analogous extension of 
standard FTRL |[T3l . This algorithm takes the same parameters as for OMD and is defined as follows: 

Let w° = argmin^g^( 5 .) TZ{w) and: 

/ T-l 

wf = argmax ( w, ^ u- + Mf 
wGA(Si) \ 

We consider three variants of OFTRL with different choices of the sequence M*, incorporating the 
recency bias in different forms. 

One-step recency bias: The simplest form of OFTRL uses M* = and obtains the following 
result, where R = max^ ^supfg^( 5 .) ~ inffGA(Si) 

Proposition 7. The OFTRL algorithm using stepsize p and M* = satisfies the RVU property 
with constants a = R/p, = p and 7 = 1/ (ip). 

Combined with Theorem|^ this yields the following constant bound on the total regret of all players: 
Corollary 8. If each player runs OFTRL with M* = and p = l/(2(n — 1)), then we have 
^ nR/p < 2n{n - 1)R = 0(1). 

Rakhlin and Sridharan ifTTll also analyze an FTRL variant, but require a self-concordant barrier for 
the constraint set as opposed to an arbitrary strongly convex regularize^ and their bound is missing 
the crucial negative terms of the RVU property which are essential for obtaining Theorem]^ 

iF-step recency bias: More generally, given a window size H, one can define M* = 

1^- following proposition. 

Proposition 9. The OFTRL algorithm using stepsize p and M* = satisfies the 

RVU property with constants a = R/p, = pH^ and 7 = l/( 4 ? 7 ). 

Setting p = l/{2H{n — 1)), we obtain the analogue of Corollaryj^ with an extra factor of H. 

Geometrically discounted recency bias: The next proposition considers an alternative form of 
recency bias which includes all the previous utilities, but with a geometric discounting. 

Proposition 10. The OFTRL algorithm using stepsize p and stitisfies 

the RVU property with constants a = R/p, j3 = p/{\ — (5)^ and 7 = 1 /( 877 ). 

Note that these choices for M( can also be used in OMD with qualitatively similar results. 

3.2 Fast Convergence of Individual Utilities 

The previous section shows implications of the RVU property on the social welfare. This section 
complements these with a similar result for each player’s individual utility. 

Theorem 11. Suppose that the players use algorithms satisfying the RVU property with parameters 
a > 0,/3 > 0, 7 > 0. If we further have the stability property ||w‘ — < n, then for any 

player “■)<« + /3«:^(n - l)^r. 

Similar reasoning as in Theoremj^yields: ||u‘ — || J < {n— 1) ~ f: {ti — 1)^, 

and summing the terms gives the theorem. 

Noting that OFTRL satisfies the RVU property with constants given in Proposition and stability 
property with k = 2p (see Lemma[20|in the appendix), we have the following corollary. 

Corollary 12. If all players use the OFTRL algorithm with M( = and p = {n— 
then we have Y/h=i ~ w*, u() < (i? + f)\/n — 1 • T^!^. 
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Similar results hold for the other forms of recency bias, as well as for OMD. Corollary [T2| gives a 
fast convergence rate of the players’ strategies to the set of coarse correlated equilibria (CCE) of the 
game. This improves the previously known convergence rate y/T (e.g. iHTI ') to CCE using natural, 
decoupled no-regret dynamics defined in 111 . 

4 Robustness to Adversarial Opponent 

So far we have shown simple dynamics with rapid convergence properties in favorable environments 
when each player in the game uses an algorithm with the RVU property. It is natural to wonder if 
this comes at the cost of worst-case guarantees when some players do not use algorithms with this 
property. Rakhlin and Sridharan IITSll address this concern by modifying the OMD algorithm with 
additional smoothing and adaptive step-sizes so as to preserve the fast rates in the favorable case 
while still guaranteeing 0{l/y/T) regret for each player, no matter how the opponents play. It is 
not so obvious how this modification might extend to other procedures, and it seems undesirable 
to abandon the black-box regret transformations we used to obtain Theorem In this section, we 
present a generic way of transforming an algorithm which satisfies the RVU property so that it retains 
the fast convergence in favorable settings, but always guarantees a worst-case regret of 0{l/y/T). 

In order to present our modification, we need a parametric form of the RVU property which will 
also involve a tunable parameter of the algorithm. Eor most online learning algorithms, this will 
correspond to the step-size parameter used by the algorithm. 

Definition 13 (RVU(/5) property). We say that a parametric algorithm A{p) satisfies the Regret 
bounded by Variation in Utilities{p) (RVU{p)) property with parameters a, /3, 7 > 0 and a pair of 
dual norms (|j • ||, || • ||*) if its regret on any sequence of utilities u^, u^,..., is bounded as 

^2 (w* - W*, n2<-+pfif2 II* - - E lk‘ - ■ (3) 

t=l ^ 4=1 P 4=1 


In both OMD and OFTRL algorithms from Section the parameter p is precisely the stepsize p. 
We now show an adaptive choice of p according to an epoch-based doubling schedule. 


Black-box reduction. Given a parametric algorithm A{p) as a black-box we construct a wrapper 
A' based on the doubling trick: The algorithm of each player proceeds in epochs. At each epoch r 
the player i has an upper bound of Br on the quantity Y^=i ll'^i ~'ii~^ll*- We start with a parameter 
p* and Bi — 1, and for r = 1, 2,..., T repeat: 

1 . Play according to A{r]r) and receive u[. 

(a) Update r ^ r + 1, B^ 2Br, Pr = min | , p* |, with a as in Equation (|^. 

(b) Start a new run of A with parameter rjr- 

Theorem 14. Algorithm A! achieves regret at most the minimum of the following two terms: 


^(w’ - w‘,u‘) <log(T) i2+2- + {‘2 + V* -/S)^ ||u* -u- ^||* ) - ^ I 
4=1 V 4=1 / 4=1 

T / 


E( 


Wi - Wi , u*) < log(r) IH-h (1 -I- a ■ ;5) 

\ n* 


\ 2^||u*-u‘ i|| 
\ 4=1 


(4) 

(5) 


That is, the algorithm satisfies the RVU property, and also has regret that can never exceed 0{y/T). 
The theorem thus yields the following corollary, which illustrates the stated robustness of A'. 

Corollary 15. Algorithm A!, with = ( 2 +/ 3 )(n^i)^ iog(T) ’ <^chieves regret 0 {Vt) against any 
adversarial sequence, while at the same time satisfying the conditions of Theorem^ Thereby, if all 
players use such an algorithm, then: + 2) = 0(1). 
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Sum of regrets 


Max of regrets 




Figure 1: Maximum and sum of individual regrets over time under the Hedge (blue) and 
Optimistic Hedge (red) dynamics. 


Proof. Observe that for such rj*, we have that; (2 + 77 * • /3) log(T) < (2 + /3) log(r) < ^ (n-ip • 
Therefore, algorithm Jf, satishes the sufficient conditions of Theorem]^ ■ 


If A{p) is the OFTRL algorithm, then we know by Proposition]^ that the above result applies with 

a = R = maxw 7?.(w), /3 = 1, 7 = j and p = 77 . Setting 77 * = ( 2 +/ 3 K„-i )2 = i 2 {n-ip ’ 

resulting algorithm A' will have regret at most: 0{n^\/T) against an arbitrary adversary, while if 
all players use algorithm A! then = 0{n^ log(T)). 


An analogue of Theorem 11 can also be established for this algorithm: 


Corollary 16. If A satisfies the RVU{p) property, and also ||w* — w* ^|| < np, then A' with 
77 * = achieves regret 0{T^A'j if played against itself and 0{y/T) against any opponent. 


Once again, OFTRL satishes the above conditions with k = 2, implying robust convergence. 


5 Experimental Evaluation 

We analyzed the performance of optimistic follow the regularized leader with the entropy regularize^ 
which corresponds to the Hedge algorithm ® modihed so that the last iteration’s utility for each 
strategy is double counted; we refer to it as Optimistic Hedge. More formally, the probability of 

player i playing strategy j at iteration T is proportional to exp ^—77 • u\j + j ^, rather 

than exp ^—77 • u\^ as is standard for Hedge. 

We studied a simple auction where n players are bidding for m items. Each player has a value v 
for getting at least one item and no extra value for more items. The utility of a player is the value 
for the allocation he derived minus the payment he has to make. The game is dehned as follows; 
simultaneously each player picks one of the m items and submits a bid on that item (we assume 
bids to be discretized). For each item, the highest bidder wins and pays his bid. We let players play 
this game repeatedly with each player invoking either Hedge or optimistic Hedge. This game, and 
generalizations of it, are known to be (1 — 1/e, 0 )-smooth ll22l . if we also view the auctioneer as a 
player whose utility is the revenue. The welfare of the game is the value of the resulting allocation, 
hence not a constant-sum game. The welfare maximization problem corresponds to the unweighted 
bipartite matching problem. The PoA captures how far from the optimal matching is the average 
allocation of the dynamics. By smoothness we know it converges to at least 1 — 1/e of the optimal. 

Fast convergence of individual and average regret. We run the game for n = 4 bidders and 
m = 4 items and valuation v = 20. The bids are discretized to be any integer in [1, 20]. We hnd 
that the sum of the regrets and the maximum individual regret of each player are remarkably lower 
under Optimistic Hedge as opposed to Hedge. In Figure we plot the maximum individual regret 
as well as the sum of the regrets under the two algorithms, using 77 = 0.1 for both methods. Thus 
convergence to the set of coarse correlated equilibria is substantially faster under Optimistic Hedge, 
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Figure 2: Expected bid and per-iteration utility of a player on one of the four items over time, under 
Hedge (blue) and Optimistic Hedge (red) dynamics. 


confirming our results in Section 3.2 We also observe similar behavior when each player only has 
value on a randomly picked player-specific subset of items, or uses other step sizes. 


More stable dynamics. We observe that the behavior under Optimistic Hedge is more stable than 
under Hedge. In Figure]^ we plot the expected bid of a player on one of the items and his expected 
utility under the two dynamics. Hedge exhibits the sawtooth behavior that was observed in gener¬ 
alized first price auction run by Overture (see Ep. 21]). In stunning contrast. Optimistic Hedge 
leads to more stable expected bids over time. This stability property of optimistic Hedge is one of 
the main intuitive reasons for the fast convergence of its regret. 


Welfare. In this class of games, we did not observe any significant difference between the average 
welfare of the methods. The key reason is the following; the proof that no-regret dynamics are 
approximately efficient (Proposition IS only relies on the fact that each player does not have regret 
against the strategy s* used in the dSnition of a smooth game. In this game, regret against these 
strategies is experimentally comparable under both algorithms, even though regret against the best 
fixed strategy is remarkably different. This indicates a possibility for faster rates for Hedge in 
terms of welfare. In Appendix [H] we show fast convergence of the efficiency of Hedge for cost- 
minimization games, though with a worse PoA . 


6 Discussion 


This work extends and generalizes a growing body of work on decentralized no-regret dynamics in 
many ways. We demonstrate a class of no-regret algorithms which enjoy rapid convergence when 
played against each other, while being robust to adversarial opponents. This has implications in 
computation of correlated equilibria, as well as understanding the behavior of agents in complex 
multi-player games. There are a number of interesting questions and directions for future research 
which are suggested by our results, including the following; 

Convergence rates for vanilla Hedge; The fast rates of our paper do not apply to algorithms 
such as Hedge without modification. Is this modification to satisfy RVU only sufficient or also 
necessary? If not, are there counterexamples? In the supplement, we include a sketch hinting at such 
a counterexample, but also showing fast rates to a worse equilibrium than our optimistic algorithms. 

Convergence of players’ strategies: The OFTRL algorithm often produces much more stable tra¬ 
jectories empirically, as the players converge to an equilibrium, as opposed to say Hedge. A precise 
quantification of this desirable behavior would be of great interest. 

Better rates with partial information: If the players do not observe the expected utility function, 
but only the moves of the other players at each round, can we still obtain faster rates? 
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Supplementary material for 
“Fast Convergence of Regularized Learning in Games” 

A Proof of Proposition 


Proposition!^ In a (A, ^)-smooth game, if each player i suffers regret at most ri[T), then: 


1 

— TV(w‘) >—^—Opt- ^ =-Opt- ^'TtiT), 

T ^ '' ^ - 1 +a 1 + aT ’ p 1 + uT 


t=i ^ ^ ieN 

where the factor p = (1 + /i)/A is called the price of total anarchy fPoAj. 


ieN 


Proof Since each player i has regret riiT), we have that; 

T T 

X](w^u*) > -r^{T) 

t=i t=i 

Summing over all players and using the smoothness property: 

i^N i^N i^N 

T 

= ^ ^ lEg^-yyt 

t^l 
T 

> ^(AOPT-^£;g^^t[M^(s)])-^r,(T) 

i^N 
T 

- ^(AOPT-//t^(w*))-^r,(T) 

i^N 

By re-arranging we get the result. 


.i&N . i^N 


( 6 ) 


B Proof of Proposition 


Proposition!^ The OMD algorithm using stepsize p and M* = u* ^satisfies the RVUproperty 
with constants a = R/p, /3 = p, 7 = 1 /( 87 ), where R = max^ sup^^ Dizift Si)- 

We will use the following theorem of M- 

Theorem 17 (Raklin and Sridharan ifTSll l. The regret of a player under optimistic mirror descent 
and with respect to any w* € A(Si) is upper bounded by: 


T T T 

^(w*-w*,u‘) < --f ^||u‘-M‘||*||w*-g*|| - ;^^(|lw*-g*f -p||w*-gyif) 

t=l ^ t=l t=l 

(7) 

where R = supj Il 7 ?,(/, go)- 


We show that if the players use optimistic mirror descent with M( = then the regret of each 

player satisfies the sufficient condition presented in the previous section. Some of the key facts 
(Equations (!^ and ( [T0| l) that we use in the following proof appear in ifTSl . However, the formulation 
of the regret that we present in the following theorem is not immediately clear in their proof, so we 
present it here for clarity and completeness. 


1 







Theorem 18. The regret of a player under optimistic mirror descent with M* = u* ^ and with 
respect to any w* S A(5'i) is upper bounded by: 


U‘) < ^ - 1 ||w‘ - ( 8 ) 

' ^—1 ' t—1 


ft _ —1 


Proof. By Theorem 17 instantiated for M* = u' , we get; 

T T 

^(w*-w‘,u*) < ^+^||u‘-u*-i||*||w*-g*|| 
1 ' +—1 




i-l\\2\ 


Using the fact that for any p > 0: 


|u‘ - M‘|U||w‘ - g‘|| < - MlWl + -llw* - g*| 


We get: 

T 




n 2 

t=i ' i=l 

For p = 2?;, the latter simplifies to; 

T ^ T 


(9) 


t-l ||2 


2 n 

' t=i 


y](w*-w‘,u‘) < |+p^|ju‘-u‘ l||2_^y^||w‘-g‘f -^y]||w‘-g‘ 1 


- 1||2 


t=l 


T 


T 


' t—1 ' t—1 ' t—1 

Last we use the fact that: 

||w‘ - < 2||w‘ - g^if + 2||w^i - g‘-if 

Summing over all timesteps: 

E ii^‘- < 2E -g‘”'f+ 2E iiw‘-i-gr^ii 


.t-l\\2 




T 


t^l 

T 


< 2y]||w‘-g^lf+ 2y]iiw‘-5‘ 


.t\\2 


( 10 ) 


Dividing over by ^ and applying it in the previous upper bound on the regret, we get; 


8r) 

E(w--w‘,u*)< ^ + 

t^l ' ' t^l 


- 1\\2 


C Proof of Proposition 


Proposition|^ The OFTRL algorithm using stepsize p and M* = u* ^ satisfies the RVUproperty 
with constants a = R/p, jS = p and 7 = l/( 4 ? 7 ). 

We first show that these algorithms achieve the same regret bounds as optimistic mirror descent. 
This result does not appear in previous work in any form. 


2 



Even though the algorithms do not make use of a secondary sequence, we will still use in the analysis 
the notation; 


gf = argmax 

geA(Si) 



^(g) 

V 


These secondary variables are often called be the leader sequence as they can see one step in the 
future. 

Theorem 19. The regret of a player under optimistic FTRL and with respect to any w* € A(5'i) is 
upper bounded by: 


^(w* - w*,u*) < ^ + ^ ||u‘ -M‘||*||w‘ -g‘|| - - g!f + l|w* -g- ^f) 


i=l ^ t=l 

where R = supf 7^(f) — inff 7^(f). 


t=i 


( 11 ) 


Proof. First observe that; 


(w* - w‘, u*) = (g* - w‘, u* - M‘) + (g* - w‘, M‘) + (w* - g‘, u‘) (12) 


Without loss of generality we will assume that inff 7^(f) = 0. Since (g* — w‘, u‘ — M* ) < l|g* - 
w‘|| ||u* — M‘||*, it suffices to show that for any w* e A(S'i); 


f: ((g- - w',M'> + {w- - g', „'» - sS"'+ 

i=l ^ t=l 




(13) 

For shorthand notation let; It = ^ ~ gtlP + ~ gi~^lP)- By induction assume 

that for all w*: 


T-l 


T-1 


((g»-w!,M‘)-(g*,u*)) < 






t=l 


T-l 
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Apply the above for w* = gf ^ and add {gf — wf, M^) — {gf ^ uj) on both sides: 






- g 


T-1 \ I ^(gi ) 




-/T-i+(gr-wf,Mf)-(gf,uf) 


T-l 


- - (H 






2p 


|wf-gf-if 


T-l 




n^i) 


-/T_i + (gf,Mf)-(gf,uf) 


2 p 


t = l / 

|wf-gf-^f 


T-l 


< “ ( gf! H u* + Mf \ + 


nsi) 

1 


-/r-i + (gf,Mf)-(gf,uf) 


^llwf-grf-^llwf-gff 


< 


- (g*^,Hu‘ 

\ t=l 
/ T 

- (q^Hu* 


7^(gf) 

V 

^M) 

V 


— It 


— It 


The inequalities follow by the optimality of the corresponding variable that was changed and by 
the strong convexity of 7?.(-). The final vector q* is an arbitrary vector in A{Si). The base case of 
T = 0 follows trivially by Tl{f) > 0 for all f. This concludes the inductive proof ■ 

Thus optimistic FTRL achieves the exact same form of regret presented in Theorem[^for optimistic 
mirror descent. Hence, the equivalent versions of Theorem 18 and Corollary hold also for the 
optimistic FTRL algorithm. In fact we are able to show slightly stronger bounds for optimistic 
FTRL, based on the following lemmas. 

Lemma 20 (Stability). For the optimistic FTRL algorithm: 

|lw‘-g‘|l < p.||M‘-u‘|U ( 14 ) 

||g‘-w‘+i|l< vm+% (15) 


Proof. LetFT(f) = (^f, XlLI u* + Mf ^ - 77 l7^(f) and Gr(f) = 

Observe that: Fylf) - GT(f) = (f, Mf - uf) and FT+i(f) - GrCf) = (f, Mf+i). 

Part 1 By the optimality of wf and gf and the strong convexity of 7^( ): 

^T(wf) > FT{gf) + ^l|wf - gf f 

Gr(gf) > GT(wf) + ^||wf - gf f 
2ri 


Adding both inequalities and using the previous observations: 


wf - gf f < (wf - gf ,Mf - uf) < llwf - gf 


\m! -ui 


Dividing over by ||wf — gfW gives the first inequality of the lemma. 
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Part 2 By the optimality of g/ and and strong convexity: 


FT+i(wf+i) > FT+i(gf) + - gf 


Gt{^J)> G'T(wf+i) + ^||wf+i - gff 


Adding the inequalities: 
1 


r] 


wf+i - gf f < (wf+i - gf, Mf+i) < ||wf+i - gf II • ||Mf+i| 

T+l 


Dividing over by ||w^ — gf ||, yields second inequality of the lemma. 


Given Theorem 19 and Lemma 20 the proposition immediately follows since 

^(w* - w*,u*) < - +? 7 ^ ||u‘ - M*||2 - (||w‘ -g‘f + ||w* -g*-i 

t=i ^ t=i t=i 

Replacing M‘ with and using Inequality ( fTO) !, yields the result. 

D Proof of Proposition 




Proposition!^ The OFTRL algorithm using stepsize ij and M* = ^ satisfies the 

RVUproperty with constants a = R/r], fi = rjH^ and 7 = l/( 4 ? 7 ). 

The proposition is equivalent to the following lemma, which we will state and prove in this appendix. 

Lemma 21. For the optimistic FTRL algorithm with M* = jj regret is upper 

bounded by: 


T 

E 


(w* - w*, u‘) < ^ + E 

I +—1 


1 ^ 

--El 


w- — w‘ ^11^ 


(16) 


where R = supfTi{t) — mffTi{i). Thus we get^^rfiT) < ^ = 2n(n—l)HRfor -q = 2 ff(n-i) 


Proof. Similar to Proposition [7] by Theorem 19 Lemma 20 and Inequality ( [T0| i we get: 


^ (w* - u‘) < ^ ll"‘ - Ell^ - ^ E ll^‘ - 


t-11|2 




t=l 

T 


R 

= - + 


1 

"E 77 E 


2 


T — t — H 


— El 


W- - w* ^IP 


= i: iK-<ii.) 


w*-w‘-i||2 


By triangle inequality: 


t-i .. t-i t-i 

— 'E 11*^! —'•^rll ^ — 'E 'E 

pj II * * II* — 7 ^ II * ' 


T — t — H 


T—t — H q—r 
-A 

t — T 


t-1 


= E ^hE-<ll*< E 


T — t — H 


T—t — H 
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By Cauchy-Schwarz: 


t-i 


t-i 


H IK”-<11. <n Y. IK”-< 


\T = t—H 


T — t—H 


Thus we can derive that; 

t^l ' r^t-H ' 


t-l\\2 


V 






Iw* - w‘-i|P 


E Proof of Proposition [l0| 


Proposition 
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The OFTRL algorithm using stepsize p and M* = t_j; X]t=o ^ 

_ Yt = 0 ^ 

satisfies the RVUproperty with constants a = R/p, P = ? 7 /(l — <5)^ and 7 = 1/( 877 ). 


The proposition is equivalent to the following lemma which we will prove in this appendix. 

Lemma 22. For the optimistic FTRL algorithm with M* = ^ some dis¬ 

count rate 6 € (0,1), the regret is upper bounded by: 


T 

E 


R 


(w* - W*,u‘) < - + 


T 


(1-(5)3 Ell"* "* ^11* 877 *^ 


w* — wf 


(17) 


where R — supf7?.(f) — inff7^(f). Thus we get '^^rfiT) < ^ = 2n{n — 1) R for 

'! 2(n-l) • 


Proof. We show the theorem for the case of optimistic FTRL. The OMD case follows analogously. 
Similar to Lemmathe regret is upper bounded by: 


Yi^i < ^+^Eii"i“^*ii*“ ^Eii'^i”’^* ^ 

^ ^—1 ^ i=l 


- 1||2 


R 


T 

t=i 


1 


U--TK- E 

V (5-^ ^ 

Z^T=0" r=0 


1 


— E 

4 ^ it 


w* — W* 


We will now show that; 


E 


t-1 


u‘ —E— E ^”""1 

Z^r=0 ^ T=0 


^ (T3^EiI"Eu‘ 


t-11|2 


which will conclude the proof. 
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First observe by triangle inequality: 


t-i 


u‘- 






r=0 ^ T=0 


1 


t-1 


r-JoS- 

1 




r=0 

t-1 


t-1 


2-^t—O ^ T=0 q=r 


Z^r=0 ^ q^O 


T=0 


1 


t-1 






r=0 q^O 


1 - 
1-5 


t-1 


< ^_^_Vr^ 

- l_<5y-*-i s-r^ 
Z^r=0 ^ 


g+1 q 

w — 


By Cauchy-Schwarz: 


t-1 


i-'*Er.u 




T=0 g=0 


g+1 g 

u; — u. 


1 


< 


(1- + 
1 


^t-i 


(Er=o^-^) 


y^j-9/2. j-«/2 


U 


9+1 


t-1 t-1 




9+1 

u; — 


1 




t-1 


(1- + EEo^ 

1 1 


g+1 g 

u, — u. 


(i-+Er=o^‘ 


9=0 
t-1 

—E^‘”" 

“1 Xt-T 2_^ 


0 9=0 


g+1 g 


<5(1- + 


t-9 


g+1 g 

u; — u. 


Combining we get: 


i-l x-T 

r=0 


Er=;^ 




< 


<5(1 - 5)2 


t-1 

V5*-« 


g+1 g 

u, — u; 


u 
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Summing over all t and re-arranging we get: 


E 


t-i 


-r4-E 

Z^r=0 " T=0 


T t-1 

< _^_VV,5‘-« 

- s{i-sy^^ 

'• > t=l n=' 


g=0 


<7+1 Q 

u, — u; 


T-l 


<5(1-(5)2 


1 


5^(5- 

9=0 

T-l 


,5(1-(5) 


= -N (1 — S'^ 

n _ A'is ^ 11 ^ ^ ^ ^ 

g-O 


_A^2 


g-0 

T-l 


g+1 q 

U, — U, 


<7+1 Q 
U, — U, 


t=9-|-l 

2 - S'^) 

* 1 — (5 


< 


( 1 - + 
1 

(T^ 


9=0 


F Proof of Theorem [l4l 


Theorem [l4l Algorithm A! achieves regret at most the minimum of the following two terms: 


^ (w* - w‘,u*) <log(r) I 2 -I- — -I- (2 -I- 77 * • llu* - u‘ ^||2 j - — ^ 

t=i \ t=i / t=i 

V (w* - w‘, u*) < log(r) I 1 + — -I- (1 -I- a ■ ■ 

t=i \ 


I t t-l\\2 

|w^ - II ; 


2^||u*-u‘-i||2 
\ t=l , 


Proof We break the proof in the two corresponding parts. 


First part. Consider around r and let be its final iteration. Also let +7 ~ + ^11*- 

First observe that by the definition of Br'. 


By the definition of r], we know that 


^Ir<Br<2-Ir + l 


1 1 1 


rj^ r] r]^ a 

By the regret guarantee of algorithm A{rjr), we have that: 

Tr T-r T, 

,t-l||2 7 


(18) 


(19) 


t=Tr-l+l ^ t=T,_i-(-l 


V 


E 




t—Tr—1 + 1 


<++vE+^*-/3 e iiu‘-u*-i|i2-+ Yl 

t=T,_i+i t=T,_i-ri 

- ~ + +E11+“^ E 11 +-+”+^ 

t=l t=T,_i-|-l 
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Since + 1 < 2 • /^ + 2: 


(w* - w*, u‘) < — + 2 + (2 + ?7* • /3) ^ ||u‘ - u 
t=T,_i + l i=l 


t]* 


E 


|w--W* 


t=T,_i + l 


Since at each round we are doubling the bound and since X)t=i IIthere are at 
most log(T) rounds. Summing up the above inequality for each of the at most log(T) rounds, yields 
the claimed bound in Equation Q. 




Second part. Again consider any round r. By Equations ( [TSl l, ( [T^ , the fact that p < 
and by the regret of algorithm A{r]r): 


Y (w*-w*,u‘)< — + v^+??•/? Y ii"i-"El 

t=T„_i + l 


t=T,_i + l 


< -h -\- rj ■ (i ■ Ir 

v* 

< -h \/ Br + a • /3 • 

v* 

< -h '\/2Ir + 1 + Ct • /3 • \/ 2Ij- 

v* 

< -h 1 + \/ 2Ir + Ct • /? • \/2/r 

v* 


< — + 1 + (1 + a- /3), 
v* 


2Yh-<-^e 

\ t=l 


Again since the number of rounds is at most log(r), by summing up the above bound for each round 
r, we get the second part of the theorem. ■ 


G Proof of Corollary [l6| 


Corollary 


16 


If A satisfies the RVU{p) property, and also 


IIw‘ - w‘-i| 


< up, then A! with 


77* = T achieves regret 0{T^A'^ if played against itself and 0 {s/T) against any opponent. 


Proof Observe that at any round of A', algorithm A is run with Pr A rj^,. Thus by the property of 
algorithm A, we have that at every iteration: ||w‘ — '^~^|| < up* = kT~^A_ if all players use 
algorithm A!, then by similar reasoning as in TheoremHwe know that: 




< {r 


3=i^i 


(n-\fK^T-^A 


Hence, by Equation]^ the regret of each player is bounded by: 


Y (^i - ^ — + (1 + a • /5) 




\ 


El 




< log(T) + (1 + Q,. /3) . (n- l)2«:2T-i/2j 

= log(T) + (1 + a • /3) • (n - 1)kTA4^ = 0(tA4) 
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H Fast convergence via a first order regret bound for cost-minimization 


In this section, we show how a different regret bound can also lead to a fast convergence rate for 
a smooth game. For some technical reasons we consider cost instead of utility throughout this 
section. We use Ci : S'! x ... x Sn —>■ [0,1] to denote the cost function, and similarly to previous 
sections (7(8) = Ci(s),(7(w) = Es„.w[C'(s)],O pt' = minsgSix...xS„ C'(s). A game is 

(A, /i)-smooth if there exists a strategy profile s*, such that for any strategy profile s: 

^ c,(4, s_,) < AOpt' + pC(s). (20) 

iGN 

Now suppose each player i uses a no-regret algorithm to produce w* on each round and receives 
cost c‘ j, = [ci(s, s_j)] for each strategy s G Si. Moreover, for any fixed strategy s, the 

no-regret algorithm ensures 

< Ai 


\ 


logd +A2logd 


t=i t=i 

for some absolute constants Ai and A 2 . Note that this form of first order bound can be achieved by 
a variety of algorithms such as Hedge with appropriate learning rate tuning. Under this setup, we 
prove the following; 

Theorem 23. If a game is (A, fj,)-smooth and each player uses a no-regret algorithm with a regret 
satisfying Eq. •H), then we have 

T 

1 tN ^ + ! Anlogd 

— ^C(w*)<-A- HOpt'-I- ^ 


fyl - fy) 


where A = 


2 A 2 

l-M' 


Proof. Using the regret bound and Cauchy-Schwarz inequality, we have 

t=i t=i ieN 




i^N 

T 


i^N 


\ +^2nlogd 

\ t=l 


l^yn logd^ 

i^N 

By the smoothness assumption, we have 


aEE<.( -1-^271 log d. 

\ T=lii 


ieN 




ieN 




.ieN 


( 22 ) 


< AOpt' -i- [(7(s)] = AOpt' -|- 


and therefore A s* — where we define x = \J ATOpt' -f p-J2t=i C{w^). Now 

applying this bound in Eq. ( |2^ , we continue with 

— — ATOpt') < -I- (Ai ^/nlogd)x -\- A 2 n log d. 




Rearranging gives a quadratic inequality ax^ bx c < 0 with 

a = b =—AiVnhid, c = — —TOpt'— A 2 nlogd, 

/i fj. 

and solving for x gives 

X < XT—— ^(—b + y/b"^ — 4ac) < — 'Jb'^ — 2ac. 

2(1-f) 1-M 

Finally solving for ^i'^*) (hidden in the definition of x) gives the bound stated in the theorem. 
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Note that the price of total anarchy is larger than the one achieved by previous analysis by a mul¬ 
tiplicative factor of 1 -f but the convergence rate is much faster (n times faster compared to 
optimistic mirror descent or optimistic FTRL). 

I Extension to continuous strategy space games 

In this section we extend our results to continuous strategy space games such as for instance ’’split- 
table selfish routing games” (see e.g. EOl). These are games where the price of anarchy has been 
well studied and quite well motivated from internet routing. In these games we consider the dynam¬ 
ics where the players simply observe the past play of their opponents and not the expected past play. 
We consider dynamics where players don’t use mixed strategies, but are simply doing online convex 
optimization algorithms on their continuous strategy spaces. Such learning on continuous games 
has also been studied in more restrictive settings in ||6l . 

In this setting we will consider the following setting: each player i has a strategy space Si which is 
a closed convex set in In this setting we will denote with Wj S 5^ a strategy of a playetj^ Given 
a profile of strategies w = (wi,..., w„), each player incurs a cost Ci(w) (equivalently a utility 
function Ui{w). 

We make the following two assumptions on the costs: 

1. (Convex in player strategy) For each player i and for each profile of opponent strategies 
w_i, the function Ci{-, w_i) is convex in w^. 

2. (Lipschitz gradient) For each player i, the function 5i(w) = ViCi(w)J^is L-Lipschitz 

continuous with respect to the || • ||i norm and if w_i S is viewed as a vector in 

the {n — 1) ■ d dimensional space, i.e.: 

||<5i(w) - (5,(y)||* < L-^llwj -y^ll (23) 

3 

Observe that a sufficient condition for Property (2) is that the function Si{w) is coordinate-wise 
L-lipschitz with respect to the |j • || norm. 

Lemma 24. If for any j: 

Ili5i(w) - (5,(y^-, w_j)||, < L\\wj - yjW (24) 

then Sif) satisfies Property (2). 

Proof For any two vectors w and y, think of switching from the one to the other by switching 
sequentially each player from his strategy to y^, keeping the remaining players fixed and in 
some pre-fixed player order. The difference ||5i(w) — i5i(y)||* is upper bounded by the sum of 
the differences of these sequential switches. The difference of each such unilateral switch for each 
player j is turn upper bounded by ||wj — y^jl, by the property assumed in the Lemma. The lemma 
then follows. ■ 


Example. (Connection to discrete game). We can view the discrete action games as a special 
case of the latter setting, by re-naming mixed strategies in the discrete game to pure strategies in 
the continuous space game. Under this mapping, the continuous strategy space is the simplex in 
where d is the number of pure strategies of the discrete game. Moreover the costs Ci(w) (equiv. 
utilities) are multi-linear, i.e. Ci(w) = Ci{s) Obviously, these multi-linear costs satisfy 

assumption 1, i.e. they are convex (in fact linear) in a players strategy. 

The second assumption is also satisfied, albeit with a slightly more involved proof, which appears 
in the proof of Theorem]^ Basically, observe that 

(25) 

s-i 

"'We will use Wi instead of Si for a pure strategy, since pure strategies of the continuous game will be sort 
of treated equivalently to mixed strategies in the discrete game we described in Sectionj^ 

^With ViCi(w) we denote the gradient of the function with respect the strategy of player i and fixing the 
strategy of other players. Equivalently for each fixed w_i it is the gradient of the function Ci(-, w_i). 
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Assuming Ci{s) < 1: 


|<5*.s,(w) - (5*,s,(y)| < ^ 

S —-i 


3¥^i 


3¥=i jA* 


Where the last inequality holds by the properties of total variation distance. 


(26) 


Example. (Splittable congestion games). In this game each player i has an amount of flow fi < B 
he wants to route from a source Si to a sink ti in an undirected graph G = {V,E). Each edge e G Eis 
associated with a latency function £e{fe) which maps an amount of flow fe passing through the edge 
to a latency. We will assume that latency functions are convex, increasing and twice differentiable. 
We will also assume that both fe(-) and are A'-lipschitz functions of the flow. We will denote 
with Vi the set of (s^, ti) paths in the graph. Then the set of feasible strategies for each player is 
all possible ways of splitting his flow fi onto these pats Vi- Denote with Wp the amount of flow a 
player routes on path p G Vi, then the strategy space is: 

5* = < Wi G ^ Wi^p = fp\ (27) 

[ pePi J 

The latter is obviously a closed convex set in I. 

For an edge e, let /i,e(wi) = J^pev -eep '^i,p edge e caused by player i and with 

/e(w) = fi,e{^i) to be the total flow on the edge e. Then the cost of a player is: 

Ci(w) = ^ /i,e(Wj) • f(/e(w)) (28) 


First observe that the functions Ci(-) are convex with respect to a player’s strategy w^. This follows 
since the cost is linear across edges, thus we need to show convexity locally at each edge. The latency 
function on an edge is a convex function of the total flow, hence also xle{x + h) is also a convex 
function of x. Now observe that the cost from each edge is of the form /i,e(wi)fe(/i,e(wi) + b) 
which is convex with respect to /i,e(wi). In turn, /i,e(-) is a linear function of w^. Thus whole cost 
function is convex in w^. 

Fast we need to show that the second condition on the cost functions is satisfied for some lipschitz- 
ness factor L. This will be a consequence of the AT-lipschitzness of the latency functions. Denote 
with fe(w) = 4(/e(w)) + /j,e(wj) • 4(/e(w)). Then, observe that: 

= 51 (^e(/e(w)) + /i,e(wi) • f;(/e(w))) = (29) 

e£p eGp 

Since both fe(') ™d 4(-) are AT-lipschitz and fi^ei'^i) < B, we have that: 


l<5*.p(w) - 4.p(y)| < 51 l^e(w) - ^L(y)l 

eGp 

< - We{y))\ l^e(/e(w)) - 4(/e(y))l 

eGp eGp 

< K{1+B)Y^ |/e(w) - /e(y)| < K{1 + B)'^Y^ l/i.e(Wi) “ /j.e(yj)| 

eep eepje[n] 


< k{i + b)'^Y^ 51 H,q-y3,<i\ 

eep je[n] q&Vj-.eGq 

= K{l + B) 51 51 51 \w3,q-y3,q\<K{lEB)mY, 51 

iG[n] geP, eeprig ie[n]96'Pj 

< K{l + B)m E llwj -y^lli 
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Thus we get that the second condition is satisfied with L = 2Km. 


For these games we will assume that the players are performing some form of regularized learning 
using the gradients of their utilities as proxies. For fast convergence we would require that the 
algorithms they use satisfy the following property, which is a generalization of Theorem]^ 

Theorem 25. Consider a repeated continuous strategy space game where the cost functions satisfy 
properties 1, 2. Suppose that the algorithm of each player i satisfies the property that for any w* € 


^Ci(w*) - c,(w*,w*_J < a + ||(5i(w*) - di{\ 


rt-l\ 






T 

t=l 


Iw‘-w*-i| 


(30) 


for some a > 0 and 0 < /3 < and with || • || we denote the |j • ||i norm. Then: 


Y,n{T)<n-a = 0{l) (31) 

iGN 


Proof By property 2, we have that: 

t=l t=l \iG[n] / ‘=lj6H 

By summing up the regret inequality for each player and using the above bound we get: 

^r,(r) <n- a + fiL^n^Y Y ~ ^Y Y ( 32 ) 

iGAf i=liG[n] 

If fiLfvf < 7 , the theorem follows. ■ 


All the algorithms that we described in the previous sections can be adapted to satisfy the bound 
required by Theorem by simply using the gradient of the cost as a proxy of the cost instead 
of the actual cost. This follows by standard arguments. Hence if players follow for instance the 
following adaptation of the regularized leader algorithm: 


Wj = argmax ( w, 
wGS, 




T-1 

E 


5i(w*) + 5i(w-' ))- 


TZfyv) 


(33) 


25 


for a = 

V ’ 


then by Proposition we get that their regret satisfies the conditions of Theorem 
fi = r] and y — where R — argmax.j,^,.gg. R{'Wi). We need that rjLfn? < ^ or equivalently 
T] < Thus for f] = if all players are using the latter algorithm we get regret of at most 


n - ^ = 2Ln^R 

V 


Example. (Splittable congestion games). Consider the case of congestion games with splittable 
flow, where all the latencies and their derivatives are iT-Lipschitz and the flow of each player is 
at most B. In that setting, suppose that we use the entropic regularizer. Then for each player i, 
R < B ■ log(|7^i|). The number of possible (s,f) paths is at most 2™, which yields R < B ■ m. 
Hence, by using the linearized follow the regularized leader, we get that the total regret is at most 
2Ln^R < 2K{B + l)B'm^'nf. ■ 


J Vl{Yt) Lower Bounds on Regret for other Dynamics 

We consider a two-player zero-sum game which can be described by a utility matrix A. Assume the 
row player uses MWU with a fixed learning rate rj, and the column player plays the best response, 
that is, a pure strategy that minimizes the row player’s expected utility for the current round. Then 
the following theorem states that no matter how rj is set, there is always a game A such that the 
regret of the row player is at least n{\/T). 
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Theorem 26. In the setting described above, let r{T) and r' [T] be the regret of the row player for 
the game A = and A' = respectively after T rounds. 77zen max{r(r), r'(T)} > 

niVr). 


Proof For game A, according to the setup, one can verify that the row player will play a uniform 
distribution and receive utility ^ on round t where t is odd, and for the next round t+1, the row player 
will put slightly more weights on one row and the column player will pick the column that has 0 
utility for that row. Specifically, the expected utility of the row player is 
Therefore, the regret is (assuming T is even for simplicity) 


T T / I 1 


T e’* - 1 
2 ' + 1 ' 


For game A!, the expected utility of the row player on round t is 


ev(t-i)+l 


and thus the regret is 


r'(r)=T-^ 


gVit-i) _|_ 


- V ^ ^ V ^ 

I 2-^ -I- 1 “ 2-^ 


1 - e"^'' 
2(1 - e-’?)' 


Now if ry > 1, then r{T) > ^ = Id{T). If ry < then r'{T) > 


n(T). Finally when ^ < ry < 1, we have 


1 > 

2(l-e T) 


T(l-e-f _ 


-1 


T pv - I I - e 

r{T) -\-r'(T) > — ■ -1— -^ 

^ ^ - 2 e + 1 2(1-e-’)) 

To sum up, we have max{r(T), r'(T)} > n{\/T). 


-1 ^ 1 - e-i 
- I' e + 1 '^2(e't- 1) 


> \ T 


1 — e~- 
e + 1 


= n{VT). 
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